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Abstract — Fast convergence speed is a desired property for training latent Dirichlet allocation (LDA), especially in online and parallel 
topic modeling for massive data sets. This paper presents a novel residual belief propagation (RBP) algorithm to accelerate the 
convergence speed for training LDA. The proposed RBP uses an informed scheduling scheme for asynchronous message passing, 
which passes fast-convergent messages with a higher priority to influence those slow-convergent messages at each learning iteration. 
Extensive empirical studies confirm that RBP significantly reduces the training time until convergence while achieves a much lower 
predictive perplexity than other state-of-the-art training algorithms for LDA, including variational Bayes (VB), collapsed Gibbs sampling 
(GS), loopy belief propagation (BP), and residual VB (RVB). 

Index Terms — Latent Dirichlet allocation, topic models, residual belief propagation, Gibbs sampling, variational Bayes. 



1 Introduction 

Probabilistic topic modeling [T| is an important prob- 
lem in machine learning and data mining. As one of 
the simplest topic models, latent Dirichlet allocation 
(LDA) [2J requires multiple iterations of training until 
convergence. Recent studies find that the convergence 
speed determines the efficiency of topic modeling for 
massive data sets. For example, online topic modeling 
algorithms partition the entire data sets into mini- 
batches, and optimize sequentially each mini-batch un- 
til convergence. Another example lies in parallel topic 
modeling algorithms [4J, which optimize the distributed 
data sets until convergence and then synchronize the 
global topic distributions. Therefore, faster convergence 
speed leads to faster online and parallel topic modeling 
algorithms for massive data sets. 

Training algorithms for LDA can be broadly catego- 
rized into variational Bayes (VB) |2], collapsed Gibbs 
sampling (GS) [5 J and loopy belief propagation (BP) |6J. 
According to a recent comparison [6J, VB requires 
around 100 iterations, GS takes around 300 iterations 
and synchronous BP (sBP) needs around 170 iterations 
to achieve convergence in terms of training perplexity. 
Although VB uses the minimal number of iterations to 
achieve convergence, its digamma function computation 
is so time-consuming as to slow down the convergence 
speed J6), f7|. GS is a stochastic Markov chain Monte 
Carlo (MCMC) process, which practically takes more 
iterations for convergence. In contrast, sBP is a deter- 
ministic scheme with smaller number of iterations until 
convergence than GS. Moreover, sBP does not involve 
complicated digamma functions, and thus gains faster 
convergence speed over VB and GS. 

In this paper, we further adopt a residual belief prop- 
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agation (RBP) |8| algorithm to accelerate the conver- 
gence speed of topic modeling. Compared with sBP, RBP 
uses an informed scheduling strategy for asynchronous 
message passing, in which it efficiently influences those 
slow-convergent messages by passing fast-convergent 
messages with a higher priority Through dynamically 
scheduling the order of message passing based on the 
residuals of two messages resulted from successive iter- 
ations, RBP in theory converges significantly faster and 
more often than sBP |8|. The novelty of this paper is 
to introduce RBP into the probabilistic topic modeling 
community, which significantly speeds up the conver- 
gence for training LDA. Although jumping from sBP 
to RBP is a simple and straightforward idea, extensive 
experimental results demonstrate that RBP in most cases 
converges fastest while reaches the lowest predictive 
perplexity when compared with other state-of-the-art 
training algorithms, including VB |2|, GS [5], sBP [6|, and 
residual VB (RVB) [9], |10J. Because of its ease of use and 
fastest convergence speed, RBP is a strong candidate for 
becoming the standard LDA training algorithm, which 
may inspire faster online and parallel topic modeling 
algorithms in the near future. 

2 Related Work 

Recently, LDA [2] has seen a rapid development for 
solving various topic modeling problems, because of its 
elegant three-layer graphical representation as well as 
two efficient approximate inference methods like VB |2| 
and GS |5|. Both VB and GS have been widely used to 
learn LDA-based topic models until our recent work [ 6 ] 
reveals that there is yet another learning algorithm for 
LDA based on BP. Extensive experiments show that 
the synchronous BP (sBP) is faster and more accurate 
than both VB and GS, and has the potential to be- 
come a generic learning scheme for LDA-based topic 
models. The basic idea of sBP is inspired by the factor 
graph fTlf representation for LDA within the Markov 
random field (MRF) framework. Similar BP ideas have 
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been also proposed within the approximate mean-field 
framework fT2ll as the zero-order approximation of col- 
lapsed VB (CVBO) algorithms [7j. 

VB, GS and sBP can be explained within the unified 
message passing framework. All these algorithms infer 
the marginal distribution of topic label for the word in- 
dex called message, and estimate parameters by the itera- 
tive expectation-maximization (EM) algorithm according 
to the maximum-likelihood criterion [13]. They mainly 
differ in the E-step of EM algorithm for message update 
equations. VB is a synchronous variational message pass- 
ing algorithm lTl4l , which updates variational messages 
by complicated digamma functions, introducing biases 
and slowing down the training speed [6 J, [7J. GS updates 
messages by discrete topic labels randomly sampled 
from the message in the previous iteration. Obviously 
the sampling process does not keep all uncertainties 
encoded in the previous message. Also, such a stochastic 
message updating often requires more iterations until 
convergence. By contrast, sBP directly uses the previous 
messages to update the current messages without sam- 
pling. Such a deterministic process often takes the less 
number of iterations than GS to achieve convergence. 

Similar to the proposed RBP, residual VB (RVB) algo- 
rithms for LDA [9], [10] have also been proposed from a 
matrix factorization perspective. Because VB is in nature 
a synchronous message passing algorithm, it does not 
have the direct asynchronous residual counterpart. So, 
RVB is derived from online VB (OVB) algorithms (3), 
which divide the entire documents into mini-batches. 
Through dynamically scheduling the order of mini- 
batches based on residuals, RVB is often faster than 
OVB to achieve the same training perplexity Indeed, 
there are several major differences between RVB and 
the proposed RBR First, it is obvious that they are 
derived from different OVB and sBP algorithms, respec- 
tively. While OVB can converge to the VB's objective 
function, it practically involves complicated digamma 
functions for biases and the slowness [6 J, [7]. Second, 
RVB randomly generates a subset of mini-batches from 
a complicated residual distribution for training, while 
the proposed RBP simply sorts residuals in a descending 
order for either documents or vocabulary words. Notice 
that the random sampling process often misses those 
important mini-batches with largest residuals, but the 
sorting technique ensures to locate those top documents 
or vocabulary words with largest residuals. Because 
larger residuals correspond to more efficiency [9J, [10], 
our simple sorting technique in RBP is more efficient 
than the random sampling strategy in RVB. This is one 
of the major reasons that RBP has a faster speed than 
RVB. Finally, RBP often achieves a much lower predictive 
perplexity than RVB, partly because digamma functions 
in RVB introduce biases in parameter estimation. 

3 Residual Belief Propagation 

In this section, we first introduce the conventional sBP al- 
gorithm for training LDA [6]. From the Markov random 



field (MRF) perspective, the probabilistic topic modeling 
task can be interpreted as a labeling problem. We assign 
a set of thematic topic labels, z = {z* d }, to explain 
the nonzero elements in the document-word matrix, 
x = {x Wt d\, where 1 < w < W and 1 < d < D 
are the word index in vocabulary and the document 
index in corpus. The notation 1 < k < K is the 
topic index. The nonzero element x W: d ^ denotes 
the number of word counts at the index {w, d}. The 
topic label satisfies J2k z w <i = ^ where d = {0, 1}. 
To maximize the joint probability p(x, z\a, (3) of LDA, 
the sBP algorithm computes the conditional marginal 
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message, [i(z^ d = 1) = [i W: d(k), which can be normalized 



using a local computation, i.e., J2k=i ^w,d(k) = 1,0 < 
fJ-w,d{k) < 1. The message is proportional to the product 
of its neighboring messages, 
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where —w and — d denote all word indices except w and 
all document indices except d. Based on messages, the 
multinomial parameters 8 and </> can be estimated as 
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In sBP, the synchronous schedule updates all messages 
simultaneously at iteration t based on the messages 
at previous iteration t — 1. Although in practice this 
schedule often converges, it often uses the more number 
of training iterations until convergence than VB |6[. 

The asynchronous schedule updates the message of 
each variable in a certain order, which is in turn used to 
update other neighboring messages immediately at each 
iteration t. The basic idea of RBP |S| is to select the best 
updating order based on the messages' residuals r w 
which are defined as the p-norm of difference between 
two message vectors at successive iterations, 



fw,d — X w ,d\\H w j 



^w,d Hp' 



(4) 



where x Wt d is the number of word counts. Here, we 
choose the L\ norm with p = 1. If we sequentially update 
message in a descending order of r w _d at each iteration, 
the RBP algorithm theoretically converges faster or more 
often to a fixed point than sBP. Because we extend sBP 
algorithms to classical RBP algorithms, the theoretical 
proof of RBP's fast convergence rate remains the same 
as that in (8). 
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input : x, K, T, a, (3. 
output : 0d,<f> w . 

nlj d {k) <— random initialization and normalization; 
W i <— random order; 
for t <- 1 to T do 
for w G W* do 

for d <— 1 to D, x Wt d 7^ do 



X 



end 

W t+1 



end 

.t+i 



M$(*0^normalize(/4+*(fc)); 



end 



insertion sort(r^ +1 , 'descending'); 



Om^\MUk) + a)/E k \MUk) + a ]; 
Mk) - \MlAk) + 0\/Y.J^l,Xk) + 0\; 



TABLE 1 

Statistics of six document data sets. 



Fig. 1 . The RBP algorithm for LDA. 



In practice, the computational cost of sorting {4j is very 
high because we need to sort all non-zero residuals r w> d 
in the document-word matrix at each learning iteration. 
This scheduling cost is expensive in case of large-scale 
data sets. Alternatively, we may accumulate residuals 
based on either document or vocabulary indices, 



I'd = ^ r w ,d, 

w 

Tw — ^ Tw,d- 



(5) 
(6) 



These residuals can be computed during message pass- 
ing process at a negligible computational cost. For large- 
scale data sets, we advocate © because the vocabulary 
size is often a fixed number W independent of the 
number of documents D. So, initially sorting r w requires 
at most a computational complexity of ©(VFlogW^) us- 
ing the standard quick sort algorithm. If the successive 
residuals are in almost sorted order, only a few swaps 
will restore the sorted order by the standard insertion 
sort algorithm, thereby saving time. In our experiments 
(not shown in this paper), RBP based on (0 uses little 
computational cost to sort r w while retains almost the 
same convergence rate as that of sorting |4). We see that 
Eq. ((5} is also useful for small-scale data sets, because in 
this case D < W as shown in Table [TJ 

Fig. [TJ summarizes the proposed RBP algorithm based 
on ((6), which will be used in the following experiments. 
First, we initialize messages randomly and normalize 
them locally. Second, we start a random order of w G W 1 
and accumulate residuals during message updating. 
At the end of each learning iteration t, we sort r^" 1 in the 
descending order to refine the updating order w G W t+1 . 
Intuitively, residuals reflect the convergence speed of 
message updating. The larger residuals correspond to 
the faster-convergent messages. In the successive learn- 
ing iterations, RBP always start passing fast-convergent 
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messages with a higher priority in the order W t+1 . 
Because the asynchronous message passing influences 
the current message updating by the previous message 
updating, passing fast-convergent messages will speed 
up the convergence of those slow-convergent messages. 

4 Experimental Results 

We carry out experiments on six publicly available data 
sets: 1) 20 newsgroups (NG20), 2) BLOG, 3) CORA, 4) 
MEDLINE, 5) NIPS, and 6) WEBKB. Table Q] summarizes 
the statistics of six data sets, where D is the total number 
of documents in the corpus, W is the number of words 
in the vocabulary, Nd is the average number of word 
tokens per document, and Wd is the average number of 
word indices per document. All subsequent figures show 
results on six data sets in the above order. We compare 
RBP with three state-of-the-art approximate inference 
methods for LDA including VB 0, GS 0, and sBP 
under the same fixed hyperparameters a = (3 = 0.01. 
We use MATLAB C/C++ MEX-implementations for all 
these algorithms [15|, and carry out the experiments on 
a common PC with CPU 2.4GHz and RAM 4G. 

Fig. shows the training perplexity at every 10 
iterations in 1000 iterations when K = 10 for each data 
set. All algorithms converge to a fixed point of training 
perplexity within 1000 iterations. Except the NIPS set, VB 
always converges at the highest training perplexity. In 
addition, GS converges at a higher perplexity than both 
sBP and RBP. While RBP converge at almost the same 
training perplexity as sBP, it always reaches the same 
perplexity value faster than sBP. Generally, the training 
algorithm converges when the training perplexity differ- 
ence at two consecutive iterations is below a threshold. 
In this paper, we set the convergence threshold to 1 
because the training perplexity decreases very little after 
this threshold is satisfied in Fig. [2] 

Fig. [3] illustrates the number training iterations un- 
til convergence on each data set for different topics 
K e {10,20,30,40,50}. The number of iterations until 
convergence seems insensitive to the number of topics. 
On the BLOG, CORA and WEBKB sets, VB uses the 
minimum number iterations until convergence, consis- 
tent with the previous results in [6J. For all data sets, 
GS consumes the maximum number of iterations until 
convergence. Unlike the deterministic message updating 
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Fig. 2. Training perplexity as a function of number of iterations when K = 10. 
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Fig. 3. Number of training iterations until convergence as a function of number of topics. 
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Fig. 4. Training time until convergence as a function of number of topics. 
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Fig. 5. Predictive perplexity for ten-fold cross-validation when K = 50. 



in VB, sBP and RBP, GS uses the stochastic message 
updating scheme accounting for the largest number of 
iterations until convergence. Although sBP costs sig- 
nificantly less number of iterations until convergence 
than GS, it still uses much more number of iterations 
than VB. By contrast, through the informed dynamic 
scheduling for asynchronous message passing, RBP on 
average converges more rapidly than sBP for all data 
sets. In particular, on the NG20, MEDLINE and NIPS 
sets, RBP on average uses a comparable or even less 
number of iterations than VB until convergence. Fig. [4] 
shows the training time in seconds until convergence on 
each data set for different topics K e {10, 20, 30, 40, 50}. 



Surprisingly, while VB usually uses the minimum num- 
ber iterations until convergence, it often consumes the 
longest training time for these iterations. The major rea- 
son may be attributed to the time-consuming digamma 
functions in VB, which takes at least triple more time 
for each iteration than GS and sBP If VB removes the 
digamma functions, it runs as fast as sBP. Because RBP 
uses significantly less number of iterations until conver- 
gence than GS and sBP, it consumes the least training 
time until convergence for all data sets in Fig. [4] We also 
examine the predictive perplexity of all algorithms until 
convergence based on a ten-fold cross-validation. The 
predictive perplexity for the unseen test set is computed 
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Fig. 6. Number of training iterations until convergence for ten-fold cross-validation when K 
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Fig. 7. Training time until convergence for ten-fold cross-validation when K 



as that in (TJ. Fig. [5] shows the box plot of predictive 
perplexity for ten-fold cross-validation when K = 50. 
The plot produces a separate box for ten predictive 
perplexity values of each algorithm. On each box, the 
central mark is the median, the edges of the box are 
the 25th and 75th percentiles, the whiskers extend to the 
most extreme data points not considered outliers, and 
outliers are plotted individually by the red plus sign. 
Obviously, VB yields the highest predictive perplexity, 
corresponding to the worst generalization ability. GS has 
a much lower predictive perplexity than VB, but it has 
a much higher perplexity than both sBP and RBP. The 
underlying reason is that GS samples a topic label from 
the messages without retaining all possible uncertainties. 
The residual-based scheduling scheme of RBP not only 
speeds up the convergence rate of sBP, but also slightly 
lowers the predictive perplexity. The reason is that RBP 
updates fast-convergent messages to efficiently influence 
those slow-convergent messages, reaching fast to the 
local minimum of the predictive perplexity. Figs. [6] and 
illustrate the box plots for the number of iterations and 
the training time until convergence for ten-fold cross- 
validation when K = 50. Consistent with Figs. [3] and |H 
VB consumes the minimum number of iterations, but has 
the longest training time until convergence. GS has the 
maximum of number of iterations, but has the second 
longest training time until convergence. Because RBP 
improves the convergence rate over sBP, it consumes the 
least training time until convergence. 

To measure the interpretability of inferred topics, Fig. [8] 
shows the top ten words of each topic when K = 10 
on CORA set using 500 training iterations. We observe 
that both sBP and RBP can infer almost the same topics 
as other algorithms except the topic one, where sBP 
identifies the "pattern recognition" topic but RBP infers 
the "parallel system" topic. It seems that both sBP and 
RBP obtain slightly more interpretable topics than GS 
and VB especially in topic four, where "reinforcement 
learning" is closely related to "control systems". For 
other topics, we find that they often share the similar 



top ten words but with different ranking orders. More 
details on subjective evaluation for interpretability of 
topics can be found in [16]. However, even if GS and 
VB yield comparably interpretable topics as RBP, we 
still advocate RBP because it consumes less training time 
until convergence while reaches a much lower predictive 
perplexity value. 

We also compare RBP with other residual-based tech- 
niques for training LDA such as RVB [9J, [10]. It is not 
easy to make a fair comparison because RBP is an offline 
learning but RVB is an online learning algorithm. How- 
ever, using the same data sets WEBKB and NG20 El, 
we can approximately compare RBP with RVB using the 
training time when the predictive perplexity converges. 
When K = 100, RVB converges at the predictive perplex- 
ity 600 using 60 seconds training time on WEBKB, while 
it converges at the predictive perplexity 1050 using 600 
seconds training time on NG20. With the same experi- 
mental settings as RVB (hyperparameters a — j3 = 0.01), 
RBP achieves the predictive perplexity 540 using 35 
seconds for training on WEBKB, while it achieves the 
predictive perplexity 1004 using 420 seconds for training 
on NG20. The significant speedup is because RVB in- 
volves relatively slower digamma function computation, 
and adopts a more complicated sampling method based 
on residual distributions for dynamic scheduling. 

5 Conclusions 

This paper presents a simple but effective RBP algorithm 
for training LDA. Through the residual-based dynamic 
scheduling scheme, RBP significantly improves the con- 
vergence rate of sBP but adding only an affordable 
scheduling cost for large-scale data sets. On average, it 
reduces around 50 ~ 100 training iterations until conver- 
gence, while achieves a relatively lower predictive per- 
plexity than sBP. For the ten-fold cross-validation on six 
publicly available document sets when K = 50, RBP on 
average reduces 63.7% and 85.1% training time until con- 
vergence than two widely-used GS and VB algorithms, 
respectively. Meanwhile, it on average achieves 8.9% 
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Fig. 8. Top ten words of K = 10 topics for GS (blue), VB (red), sBP (green) and RBP (black) on CORA set. 



and 22.1% lower predictive perplexity than GS and VB, 
respectively. Compared with other residual techniques 
like RVB, RBP reduces around 30% ~ 50% training time 
to achieve the lower predictive perplexity. While RBP is 
a simple extension of sBP 1 6 1 by introducing the dynamic 
scheduling for message passing, its theoretical basis |8| 
and strong experimental results support its promising 
role in the probabilistic topic modeling field. 
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